The analysis pipelines used by IGSR build on those created for the 1000 Genomes Project. For more detailed information about the analysis methods used by the 1000 Genomes Project in its different phases, please refer to our publications.
IGSR is employing an alt-aware alignment strategy using the most recent version of BWA-mem when aligning data to GRCh38. This uses the full GRCh38 reference, including ALT contigs, decoy and EBV sequences (accession GCA_000001405). In addition, more than 500 HLA sequences compiled by Heng Li from the IMGT/HLA database provided by the Immuno Polymorphism Database (IPD) were included as well.
The pipeline aligns sequence data at the run level and then merges runs belonging to the same sample together to produce sample level alignments. GATK BAM improvement steps are used, as in the 1000 Genomes phase 3 pipeline. By using the complete GRCh38 genome, we should have improved read mapping accuracy, providing a better foundation for further analyses.
Information on alt-aware BWA can be found on the bwa site.
During the main 1000 Genomes Project, sequence reads were aligned to GRCh37. In phase 1, reference as providing by the Genome Reference Consortium was used, in phase 3, decoy sequence was added to the reference to reduce the rate of mismapping.
The phase1 reference FASTA can be found in technical/reference directory . It represented the full chromosomes of the GRCh37 build of the human reference. The phase 3 reference can be found in the phase2_reference_assembly_sequence directory. This contains both the full reference and the additional decoy sequence.
In the pilot phase of the 1000 Genomes Project, the data was mapped to sex matched copies of NCBI36. Our reference files can be found under the pilot_data directory.
During the 1000 Genomes Project, different mapping algorithms were used for data types. The table below describes which algorithms were used for the different data types and technology combinations in the different phases of the project.
Phase | Techology | Low Coverage | Exome/Exon Targetted | High Coverage |
---|---|---|---|---|
Pilot | Illumina | MAQ | MOSAIK | MAQ |
Pilot | 454 | MAQ | MOSAIK | MAQ |
Pilot | SOLiD | corona | N/A | corona |
Phase 1 | Illumina | bwa | bfast | N/A |
Phase 1 | 454 | bwa | bfast | N/A |
Phase 1 | SOLiD | bfast | bfast | N/A |
Phase 3 | Illumina | bwa | N/A | N/A |
Over the course of the 1000 Genomes Project, how variants were called from the samples changed quite dramatically. Two clear lessons from the project were, when considering low coverage data, calling from multiple samples at once produces more, higher quality variants and considering the sites discovered from multiple algorithms improved the discovery rate and accuracy of discovery. Many different programs and strategies were developed over the duration of the project. The publications referred to at the top of this page are the best place to get a description of what programs were used and how the 1000 Genomes variant calling pipeline was run.